Constructing Corpora of South Asian Languages
نویسندگان
چکیده
The EMILLE Project (Enabling Minority Language Engineering) was established to construct a 67 million word corpus of South Asian languages. In addition, the project has had to address a number of issues related to establishing a language engineering (LE) environment for South Asian language processing, such as translating 8-bit language data into Unicode and producing a number of basic LE tools. This paper will focus on the corpus construction undertaken on the project and will outline the rationale behind data collection. In doing so a number of issues for South Asian corpus building will be highlighted.
منابع مشابه
Developing Asian language corpora: standards and practice
This paper first discusses standards for developing Asian language corpora so as to facilitate international data exchange. Following this, we present two corpora of Asian languages developed at Lancaster University – the EMILLE Corpus, which contains 14 South Asian languages, and the Lancaster Corpus of Mandarin Chinese. Finally, we will demonstrate how to explore these corpora using Xara and ...
متن کاملComparison of Unigram, Bigram, HMM and Brill’s POS Tagging Approaches for some South Asian Languages
Part-of-Speech (POS) Tagging is a process that attaches each word in a sentence with a suitable tag from a given set of tags. POS Tagging is important in various areas of Natural Language Processing. Different methods of automating the process have been developed and employed for English and other Western languages. Some similar work, most of which utilize the stochastic approaches for POS Tagg...
متن کاملMaximum Entropy Approach for Named Entity Recognition in Bengali and Hindi
This paper reports about the development of a Named Entity Recognition (NER) system in two leading Indian languages, namely Bengali and Hindi using the Maximum Entropy (ME) framework. We have used the annotated corpora, obtained from the IJCNLP-08 NER Shared Task on South and South East Asian Languages (NERSSEAL) and tagged with a fine-grained Named Entity (NE) tagset of twelve tags. An appropr...
متن کاملAligning Sentences and Words Using English-hindi Bilingual Parallel Corpora
This dissertation project relates to language engineering issues. The Enabling Minority Language Engineering (EMILLE) project is a collaborative work of The University of Sheffield and The Lancaster University. It aims to develop sixty-three million word electronic corpus of the South Asian Languages. As part of the EMILLE project, it was decided to develop a POS tagger for one of the languages...
متن کاملConstructing Parallel Corpora for Six Indian Languages via Crowdsourcing
Recent work has established the efficacy of Amazon’s Mechanical Turk for constructing parallel corpora for machine translation research. We apply this to building a collection of parallel corpora between English and six languages from the Indian subcontinent: Bengali, Hindi, Malayalam, Tamil, Telugu, and Urdu. These languages are low-resource, under-studied, and exhibit linguistic phenomena tha...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003